Good practice for project organisation
Data organisation (and visualisation) in practice
Statistical testing
P-hacking
3 April 2025
Good practice for project organisation
Data organisation (and visualisation) in practice
Statistical testing
P-hacking
10% assessment in class - THIS COUNTS TOWARDS YOUR COURSE GRADE
Select a small dataset produced in your lab.
Compare how well the dataset confirms to the practices we discuss today.
How could this dataset be analysed and visualised?
Prepare a brief (~5mins) presentation for next week.
Further details are on Blackboard.
Based on the papers:
Good enough practices in scientific computing:
Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK. PLoS Comput Biol. 2017 Jun 22;13(6):e1005510.
doi: 10.1371/journal.pcbi.1005510
Ten Simple Rules for Reproducible Computational Research
Sandve GK, Nekrutenko A, Taylor J, Hovig E. PLoS Comput Biol. 2013 Oct;9(10):e1003285.
doi: 10.1371/journal.pcbi.1003285
Sections (we’ll cover the blue stuff):
Data Management
Software
Collaboration
Project Organization
Spend two minutes thinking about the answers to the following questions:
If you were abducted by aliens in the next 30 seconds, how easy would it be for someone else to carry on your research? (after we finished mourning your loss, of course).
How much time would you need to spend getting your work to the point where another human being could carry on with your project? What would this involve?
If your laptop/desktop was unexpectedly destroyed by evil anti-research forces, what impact would this have on your project? (it’s ok, we’ll buy you a shiny new one from our “anti-anti-research forces” insurance fund).
Create analysis-friendly data
Record all the steps used to process data.
Place a brief explanatory comment at the start of every program.
Decompose programs into functions.
Be ruthless about eliminating duplication.
Always search for well-maintained software libraries that do what you need.
Test libraries before relying on them.
Give functions and variables meaningful names.
Make dependencies and requirements explicit.
requirements.txt file.Your project is very likely to be part of a larger research programme, and the work you are doing has the potential to contribute to other projects in the future.
You are also collaborating with yourself: ask “present you” about your good friend “future you”.
Create an overview of your project.
README.txt or README.md in the top-level directoryCreate a shared “to-do” list for the project.
Decide on communication strategies.
Wikimedia Commons File: Exampleofdigitalhoarding cluttereddesktop001.jpg
Put each project in its own directory, which is named after the project.
Put text documents associated with the project in the doc directory.
Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
Put project source code in the src directory.
Put external scripts or compiled programs in the bin directory.
Name all files to reflect their content or function.
|-- CITATION |-- README |-- LICENSE |-- requirements.txt |-- data | |-- birds_count_table.csv |-- doc | |-- notebook.md | |-- manuscript.md | |-- changelog.txt |-- results | |-- summarized_results.csv |-- src | |-- sightings_analysis.py | |-- runall.py
Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK. PLoS Comput Biol. 2017 Jun 22;13(6):e1005510.
10.1371/journal.pcbi.1005510
We’re going to use some really old data as an example:
We’ll start by looking at a fairly typical (Excel-based) approach to data organisation and analysis.
Often the data would be recorded in Excel, using a format somewhat like this:
Excel makes it easy to add summary dat such as:
=AVERAGE()=STDEV()=COUNT()=\(\frac{s}{\sqrt N}\)Often want to plot this summary data
YOU CAN MAKE THE PLOT RIGHT IN THE EXCEL SHEET!
How many did we just break?
.csv file (comma-separated value: still opens in Excel, but is text-based, so easy to open elsewhere)Each column is a variable, each row is an observation.
Save as a .csv file, and store in a separate Data folder.
# Read the data
tooth_growth = read.csv('Data/tooth-growth-long-format.csv')
# Look at the first few rows
head(tooth_growth)
## len supp dose ## 1 4.2 VC 0.5 ## 2 11.5 VC 0.5 ## 3 7.3 VC 0.5 ## 4 5.8 VC 0.5 ## 5 6.4 VC 0.5 ## 6 10.0 VC 0.5
The dplyr package for R provides a nice way to generate data summaries.
# Load dplyr package
library(dplyr)
# Generate data summary object (tg_summary)
tg_summary = tooth_growth %>%
group_by(supp, dose) %>%
summarize( MeanLength = mean(len),
SD = sd(len),
N = n(),
SEM = SD / sqrt(N) )
What does this produce?
# Show the tg_summary object tg_summary
## # A tibble: 6 × 6 ## # Groups: supp [2] ## supp dose MeanLength SD N SEM ## <chr> <dbl> <dbl> <dbl> <int> <dbl> ## 1 OJ 0.5 13.2 4.46 10 1.41 ## 2 OJ 1 22.7 3.91 10 1.24 ## 3 OJ 2 26.1 2.66 10 0.840 ## 4 VC 0.5 7.98 2.75 10 0.869 ## 5 VC 1 16.8 2.52 10 0.795 ## 6 VC 2 26.1 4.80 10 1.52
tg_summary %>% knitr::kable(., digits=2)
| supp | dose | MeanLength | SD | N | SEM |
|---|---|---|---|---|---|
| OJ | 0.5 | 13.23 | 4.46 | 10 | 1.41 |
| OJ | 1.0 | 22.70 | 3.91 | 10 | 1.24 |
| OJ | 2.0 | 26.06 | 2.66 | 10 | 0.84 |
| VC | 0.5 | 7.98 | 2.75 | 10 | 0.87 |
| VC | 1.0 | 16.77 | 2.52 | 10 | 0.80 |
| VC | 2.0 | 26.14 | 4.80 | 10 | 1.52 |
ggplot2 package for R has become (by far) the most popular tool for producing publication quality figures in R.R code for barplot with ggplot2
# Load the ggplot2 package library(ggplot2) # Create a basic bar plot ggplot(tg_summary, aes(x=as.factor(dose), y=MeanLength, fill=supp)) + geom_bar(stat="identity", position="dodge")
Breaking down the code:
ggplot takes the data object (tg_summary), converts dose to a factor, plots dose on the x-axis and MeanLength on the y-axis, and uses supp to define colours.geom_bar generates the barplot, using the values from tg_sumamry (identity) and puts the bars next to each other (dodge).ggplot(tg_summary, aes(x=as.factor(dose), y=MeanLength, fill=supp)) + geom_bar(stat="identity", position="dodge")
as.factorggplot(tg_summary, aes(x=dose, y=MeanLength, fill=supp)) + geom_bar(stat="identity", position="dodge")
dodgeggplot(tg_summary, aes(x=as.factor(dose), y=MeanLength, fill=supp)) + geom_bar(stat="identity")
ggplot(tg_summary, aes(x=as.factor(dose), y=MeanLength, fill=supp)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Tooth length per dose", x="Dose (mg)",
y = "Mean length", fill='Supplement')
ggplot(tg_summary, aes(x=as.factor(dose), y=MeanLength, fill=supp)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Tooth length per dose", x="Dose (mg)",
y = "Mean length", fill='Supplement') +
geom_errorbar(aes(ymin=MeanLength-SEM, ymax=MeanLength+SEM),
position=position_dodge(0.9), width=0.2)
ggplot(tg_summary, aes(x=as.factor(dose), y=MeanLength, fill=supp)) +
labs(title="Tooth length per dose", x="Dose (mg)",
y = "Mean length", fill='Supplement') +
geom_errorbar(aes(ymin=MeanLength-SEM, ymax=MeanLength+SEM),
position=position_dodge(0.9), width=0.2) +
geom_bar(stat="identity", position="dodge")
ggplot(tooth_growth, aes(x=as.factor(dose), y=len, fill=supp)) +
labs(title="Tooth length per dose", x="Dose (mg)",
y = "Length", fill='Supplement') +
geom_boxplot(outlier.alpha = 0) +
geom_jitter(position = position_jitterdodge(jitter.width=0.12), alpha=0.7)
https://www.autodesk.com/research/publications/same-stats-different-graphs?s=03
Let’s start with a smaller data set: just the OJ vs VC data for a dose of 1mg.
# Read in a reduced data set
tg_1mg = read.csv('Data/tooth-growth-long-format-dose-1mg.csv')
# Show the first few rows
head(tg_1mg)
## len supp ## 1 16.5 VC ## 2 16.5 VC ## 3 15.2 VC ## 4 17.3 VC ## 5 22.5 VC ## 6 17.3 VC
ggplot(tg_1mg, aes(x=supp, y=len)) + labs(title="Tooth length vs Supplement (1mg dose)", x="Supplement", y = "Length") + geom_boxplot(outlier.alpha = 0, width=0.3) + geom_jitter(width=0.12, alpha=0.7)
In R we can perform a t-test between OJ and VC mean lengths via:
t.test(len ~ supp, data = tg_1mg)
## ## Welch Two Sample t-test ## ## data: len by supp ## t = 4.0328, df = 15.358, p-value = 0.001038 ## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0 ## 95 percent confidence interval: ## 2.802148 9.057852 ## sample estimates: ## mean in group OJ mean in group VC ## 22.70 16.77
The p-value is 0.00104, which suggests a difference between the means of this size is unlikely to occur by chance if H0 is true.
wilcox.test() function.Stripcharts of the length distributions for each group:
Converted to ranks, the data look like this:
Wilcoxon test ignores the length distribution, and tests for rank-based differences.
wilcox.test(len ~ supp, data = tg_1mg)
## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot ## compute exact p-value with ties
## ## Wilcoxon rank sum test with continuity correction ## ## data: len by supp ## W = 88.5, p-value = 0.00403 ## alternative hypothesis: true location shift is not equal to 0
P-value (0.004) again suggests there is a significant length difference between the groups.
NB: the warning just means that an approximation is being used to calculate the p-value, as there are ties present in the data (same value).
Let’s return to the full tooth length data set.
If we wanted to test for VC vs OJ at each dose level, we’d need to perform three hypothesis tests.
Note that there are better statistical ways to perform this analysis, but we’ll keep it simple.
# Create empty vector to fill with p-values
pval = c()
# Dose = 0.5
pval[1] = t.test(len ~ supp, data = tooth_growth %>% filter(dose==0.5) )$p.value
# Dose = 1
pval[2] = t.test(len ~ supp, data = tooth_growth %>% filter(dose==1) )$p.value
# Dose = 2
pval[3] = t.test(len ~ supp, data = tooth_growth %>% filter(dose==2) )$p.value
# Show p-values
names(pval) = c("Dose_0.5mg", "Dose_1.0mg", "Dose_2.0mg")
pval
## Dose_0.5mg Dose_1.0mg Dose_2.0mg ## 0.006358607 0.001038376 0.963851589
To prevent large numbers of false positives when testing many hypotheses, we employ multiple testing correction. Two ways to think about it:
In Genome-Wide Association Studies (GWAS), the latter approach is generally used: typical genome-wide significance is set at 5x10-8.
This controls the Family-wise Error Rate (FWER) so that P(Type I error) < 0.05 across all tests being performed.
| RawP | FDR | Holm | Bonferroni | |
|---|---|---|---|---|
| Dose_0.5mg | 0.0064 | 0.0095 | 0.0127 | 0.0191 |
| Dose_1.0mg | 0.0010 | 0.0031 | 0.0031 | 0.0031 |
| Dose_2.0mg | 0.9639 | 0.9639 | 0.9639 | 1.0000 |
p.adjust# Bonferroni p.adjust(pval, method="bonferroni")
## Dose_0.5mg Dose_1.0mg Dose_2.0mg ## 0.019075820 0.003115128 1.000000000
# Holm (default for p.adjust) p.adjust(pval, method="holm")
## Dose_0.5mg Dose_1.0mg Dose_2.0mg ## 0.012717214 0.003115128 0.963851589
# FDR p.adjust(pval, method="fdr")
## Dose_0.5mg Dose_1.0mg Dose_2.0mg ## 0.009537910 0.003115128 0.963851589
Also called “data dredging”, is the misuse of data analysis methods to identify statistically significant results, which increases the risk of false positive findings.
There are two principle causes of publication bias:
The “file drawer effect”: non-significant results are under-reported
And “P-hacking”: selective reporting of only significant results
Wikipedia page: Data dredging
XKCD wisdom: Significant
n = 20 (x = rnorm(n))
## [1] 0.07050839 0.12928774 1.71506499 0.46091621 -1.26506123 -0.68685285 ## [7] -0.44566197 1.22408180 0.35981383 0.40077145 0.11068272 -0.55584113 ## [13] 1.78691314 0.49785048 -1.96661716 0.70135590 -0.47279141 -1.06782371 ## [19] -0.21797491 -1.02600445
(y = rnorm(n))
## [1] -0.72889123 -0.62503927 -1.68669331 0.83778704 0.15337312 -1.13813694 ## [7] 1.25381492 0.42646422 -0.29507148 0.89512566 0.87813349 0.82158108 ## [13] 0.68864025 0.55391765 -0.06191171 -0.30596266 -0.38047100 -0.69470698 ## [19] -0.20791728 -1.26539635
t.test(x, y)
## ## Welch Two Sample t-test ## ## data: x and y ## t = 0.11133, df = 37.088, p-value = 0.912 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.5451517 0.6085496 ## sample estimates: ## mean of x mean of y ## -0.01236911 -0.04406804
t.test(x, y, alternative = "less")
## ## Welch Two Sample t-test ## ## data: x and y ## t = 0.11133, df = 37.088, p-value = 0.544 ## alternative hypothesis: true difference in means is less than 0 ## 95 percent confidence interval: ## -Inf 0.5120184 ## sample estimates: ## mean of x mean of y ## -0.01236911 -0.04406804
t.test(x, y, alternative = "greater")
## ## Welch Two Sample t-test ## ## data: x and y ## t = 0.11133, df = 37.088, p-value = 0.456 ## alternative hypothesis: true difference in means is greater than 0 ## 95 percent confidence interval: ## -0.4486206 Inf ## sample estimates: ## mean of x mean of y ## -0.01236911 -0.04406804
wilcox.test(x,y)
## ## Wilcoxon rank sum exact test ## ## data: x and y ## W = 197, p-value = 0.9467 ## alternative hypothesis: true location shift is not equal to 0
cor.test
ks.test
kruskal.test
var.test
anova
chisq.test
shapiro.test
for (l in 1:100){
y = rnorm(50)
x = rnorm(50)
p = t.test(x, y)$p.value
if(p < 0.05){
cat("Test [", l, "] is significant (p=", p, "). You can publish this one!\n")
}
}
## Test [ 45 ] is significant (p= 0.01118617 ). You can publish this one! ## Test [ 79 ] is significant (p= 0.01904146 ). You can publish this one!